Some words to start with

Welcome to the page! This is an essential part of my final assignment to the course “Introduction to Open Data Science”, or as we friends call it “IODS” course. My name is Laura Matkala and I am a PhD student who studies forests. I have to say this is one of the most inspiring courses I have taken in a while. I will do my best with all the new skills I have learned during the course to make a best possible outcome for this assignment!

Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn't look like it now, the sun actually does exist...)

Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn’t look like it now, the sun actually does exist…)

About the dataset

I chose to use the dataset Boston, which includes data about housing in the suburbs of Boston , Massachusettes, USA. I will later perform linear regression and logistic regression to the variable “crim”, but first some basic information about the dataset.

The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

I have standardized the dataset beforehand as well as explored it a bit. You can find the R script file with all the data wrangling and codes here. The variables in the dataset are:

Analysis

Linear regression

To start with the linear regression I need to read in the data as well as call the needed packages. I will also check the structure and dimensions of the data to see that everything is in order after the data wrangling and saving the file as csv.

Boston<-read.csv(file = "C:/HY-Data/MATKALA/GitHub/IODS-final/Boston.csv", header = TRUE, sep=",")
library(GGally); library(ggplot2)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
dim(Boston)
## [1] 506  14

Everything seems to be ok with the data and it looks like I meant it to look like at this point. Let’s make a couple of plots to see what the data looks like.

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

Ok, we see that “rad”, “tax” and “indus” have the highest correlations with “crim”. I will thus use those variables for creating a linear multiple regression model, where the three first mentioned variables are used as explanatory variables for “crim”. A linear multiple regression model in this case takes the form \(y = \alpha+\beta_1x_1+\beta_2x_2+\beta_3x_3+\epsilon\), where rad, dis and ptratio are estimates of \(\beta\), \(\alpha\) is an intercept and \(\epsilon\) is the error term/random noise.

my_model <- lm(crim ~ rad + tax + indus, data = Boston)
summary(my_model)
## 
## Call:
## lm(formula = crim ~ rad + tax + indus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1835 -0.1694 -0.0288  0.0887  8.8842 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.259e-16  3.473e-02   0.000    1.000    
## rad          5.701e-01  8.591e-02   6.635 8.42e-11 ***
## tax          3.196e-02  9.960e-02   0.321    0.748    
## indus        4.429e-02  5.133e-02   0.863    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7813 on 502 degrees of freedom
## Multiple R-squared:  0.3932, Adjusted R-squared:  0.3896 
## F-statistic: 108.4 on 3 and 502 DF,  p-value: < 2.2e-16

In linear regression the aim is to minimize the residuals, which are the prediction errors of the model. The best model fit is found so that the sum of the squared residuals is minimized. The \(\beta\) values this model gives us are -5.7e-01, 3.2e-02 and 4.4e-03,respectively, whereas the \(\alpha\) (intercept) is -3.3e-16. Standard error for \(\alpha\) is 3.47e-02 and for \(\beta\)s 8.6e-02, 10.0e-02 and 5.1e-02, respectively. Based on p-values it seems that only “rad” is important for the model. I will thus leave “tax” and “indus”out and formulate another model.

my_model2<- lm(crim ~ rad, data = Boston)
summary(my_model2)
## 
## Call:
## lm(formula = crim ~ rad, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1817 -0.1605 -0.0164  0.0767  8.8860 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.149e-16  3.472e-02       0        1    
## rad          6.255e-01  3.475e-02      18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.781 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

Well, no dramatic changes compared to the previous situation if we look at the residuals. The standard error for “rad” got smaller, though, compared to model number 1. Then again, when I standardized the data, the values of the variables got very small. I don’t know if standardization was really needed, I just wanted to wrangle the data a bit instead of using it directly as it is.